Coincident change in pairs of variables
Correlation is a condition that indicates the extent to which two variables change in a coordinate fashion.
This does not imply they are functionally linked or causal in nature.
Mr. Nicolas Cage
# A tibble: 6 × 3
Year `Nicolas Cage Movies` `Drowning Deaths in Pools`
<int> <dbl> <dbl>
1 1999 2 109
2 2000 2 102
3 2001 2 102
4 2002 3 98
5 2003 1 85
6 2004 1 95
Testing for coincident changes in both sets of data.
Pearson's product-moment correlation
data: df$`Nicolas Cage Movies` and df$`Drowning Deaths in Pools`
t = 2.6785, df = 9, p-value = 0.02527
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1101273 0.9045101
sample estimates:
cor
0.6660043
100 Distinct Styles (not just IPA’s & that yellow American Corn Lager!)
Global & Regional Styles
Quantitative Characteristics
Qualitative Characteristics
The BJCP Style Guidelines exist for beer, mead, and ciders.
Not including sour beers, which use yeast and bacteria mixtures.
The more sugar in the wort, the more food for the yeast to work on, and the more alcohol that may be produced.
The difference between the gravities before and after fermentation can be used to estimate ABV.
Bitterness is created by the addition of herbs.
The color of the beer is quantified using the Standard Reference Method (SRM) scale.
Here are the raw characteristic data for the different styles.
Styles Yeast ABV_Min ABV_Max
Length:100 Ale :69 Min. :2.400 Min. : 3.200
Class :character Either: 4 1st Qu.:4.200 1st Qu.: 5.475
Mode :character Lager :27 Median :4.600 Median : 6.000
Mean :4.947 Mean : 6.768
3rd Qu.:5.500 3rd Qu.: 8.000
Max. :9.000 Max. :14.000
IBU_Min IBU_Max SRM_Min SRM_Max
Min. : 0.00 Min. : 8.00 Min. : 2.00 Min. : 3.00
1st Qu.:15.00 1st Qu.: 25.00 1st Qu.: 3.50 1st Qu.: 7.00
Median :20.00 Median : 35.00 Median : 8.00 Median :17.00
Mean :21.97 Mean : 38.98 Mean : 9.82 Mean :17.76
3rd Qu.:25.00 3rd Qu.: 45.00 3rd Qu.:14.00 3rd Qu.:22.00
Max. :60.00 Max. :120.00 Max. :30.00 Max. :40.00
OG_Min OG_Max FG_Min FG_Max
Min. :1.026 Min. :1.032 Min. :0.998 Min. :1.006
1st Qu.:1.040 1st Qu.:1.052 1st Qu.:1.008 1st Qu.:1.012
Median :1.046 Median :1.060 Median :1.010 Median :1.015
Mean :1.049 Mean :1.065 Mean :1.009 Mean :1.016
3rd Qu.:1.056 3rd Qu.:1.075 3rd Qu.:1.010 3rd Qu.:1.018
Max. :1.080 Max. :1.130 Max. :1.020 Max. :1.040
Plenty of options in the RVA!
Estimating the real value created by the the entire population of entities.
Mean of the real population, \(\mu\).
Variance of the real population, \(\sigma^2\)
The values we get by sampling from the much larger population to gain inferences
Sample mean, \(\bar{x}\)
Sample variance, \(s^2\)
Much of the way we determine the significance of a model is based upon assumptions of the underlying data.
Normality
Independence
Homoscedasticity
In general, data we work with is assumed to follow a Normal distribution with parameters \(\mu\) and \(\sigma\), often denoted as \(N(\mu,\sigma)\), (like you did with rnorm(mu,sd) in a previous homework) which can be parameterized as:
\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x - \mu}{\sigma})} \]
Visualizing the ‘normality’ of the data using built-in functions.
The Shapiro-Wilk test for normality has the null hypothesis \(H_O: Data\;are\;normal\), using the W statistic.
\(W = \frac{\left(\sum_{i=1}^Na_iR_{x_i}\right)^2}{\sum_{i=1}^N(x_i - \bar{x})^2}\)
where \(N\) is the number of samples, \(a_i\) is a standardizing coeeficient, \(x_i\) is the \(i^{th}\) value of \(x\), \(\bar{x}\) is the mean of the observed values, and \(R_{x_i}\) is the rank of the \(x_i^{th}\) observation.
The default test for this in R is performed by the shapiro.test() function. Here, we will look at the minimum ABV value from the beer dataset.
Shapiro-Wilk normality test
data: beer$ABV_Min
W = 0.94595, p-value = 0.0004532
Is Minimum ABV Normal?
Fractions and Percentages are known to behave poorly, particularly around the edges (e.g., close to 0 or 1). It is not uncommon to use a simple ArcSin Square Root transformation to try to help fractional data.
Shapiro-Wilk normality test
data: abv.1
W = 0.96746, p-value = 0.01418
Conclusion? Are these data normal?
There is a family of transformations that can be used to see if we can help data sets tend towards normality for parametric analyses.
\[ \tilde{x} = \frac{x^\lambda - 1}{\lambda} \]
test_boxcox <- function( x, lambdas = seq(-1.1, 1.1, by = 0.015) ) {
ret <- data.frame( Lambda = lambdas,
W = NA,
P = NA)
for( lambda in lambdas ) {
x.tilde <- (x^lambda - 1) / lambda
w <- shapiro.test( x.tilde )
ret$W[ ret$Lambda == lambda ] <- w$statistic
ret$P[ ret$Lambda == lambda ] <- w$p.value
}
return( ret )
}It is assumed that the variance of the data are
The samples you collect, and the way that you design your experiments are most important to ensure that your data are individually independent. You need to think about this very carefully as you design your experiments.
\(\rho = \frac{\sum_{i=1}^N(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^N(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^N(y_i - \bar{y})^2}}\)
whose values are confined to be within the range \(-1.0 \le \rho \le +1.0\)
Whose significance is tested by using a variant of the t.test:
\(t = r \frac{N-1}{1-r^2}\)
Figure 1: Data and associated correlation statistics.
As we’ve used several times so far, the cor.test() function performs simple correlation analysis, with a deafult Pearson Product Moment analysis.
Pearson's product-moment correlation
data: beer$OG_Max and beer$FG_Max
t = 15.168, df = 98, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7671910 0.8878064
sample estimates:
cor
0.8374184
Again, the object that is returned is a list and has these components.
To alleviate some of the underlying parametric assumptions, we can use ranks of the data instead of the raw data directly.
\(\rho_{Spearman} = \frac{ \sum_{i=1}^N(R_{x_i} - \bar{R_{x}})(R_{y_i} - \bar{R_{y}})}{\sqrt{\sum_{i=1}^N(R_{x_i} - \bar{R_{x}})^2}\sqrt{\sum_{i=1}^N(R_{y_i} - \bar{R_{y}})^2}}\)
Another way to circumvent some of the constraints based upon the form of the data, we can use permutation to test significance.
\(H_O: \rho = 0\)
Permutation requires that we do some simple simulation work, permuting the data assuming the Null Hypothesis is TRUE.
Now, we can go through the 999 NA values we put into that data frame and:
1. Permute one of the variables
2. Run the analysis
3. Store the statistic.
Shuffle the y variable and recalculate the test statistic 999 times.
Probability of a value as extreme or greater than the original estimate \(\to\) P-value.
Measures of correlation determine the co-movement of two variables without making any statement regarding causation.